11 research outputs found
On the Ability of Graph Neural Networks to Model Interactions Between Vertices
Graph neural networks (GNNs) are widely used for modeling complex
interactions between entities represented as vertices of a graph. Despite
recent efforts to theoretically analyze the expressive power of GNNs, a formal
characterization of their ability to model interactions is lacking. The current
paper aims to address this gap. Formalizing strength of interactions through an
established measure known as separation rank, we quantify the ability of
certain GNNs to model interaction between a given subset of vertices and its
complement, i.e. between sides of a given partition of input vertices. Our
results reveal that the ability to model interaction is primarily determined by
the partition's walk index -- a graph-theoretical characteristic that we define
by the number of walks originating from the boundary of the partition.
Experiments with common GNN architectures corroborate this finding. As a
practical application of our theory, we design an edge sparsification algorithm
named Walk Index Sparsification (WIS), which preserves the ability of a GNN to
model interactions when input edges are removed. WIS is simple, computationally
efficient, and markedly outperforms alternative methods in terms of induced
prediction accuracy. More broadly, it showcases the potential of improving GNNs
by theoretically analyzing the interactions they can model
Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks
In the pursuit of explaining implicit regularization in deep learning,
prominent focus was given to matrix and tensor factorizations, which correspond
to simplified neural networks. It was shown that these models exhibit an
implicit tendency towards low matrix and tensor ranks, respectively. Drawing
closer to practical deep learning, the current paper theoretically analyzes the
implicit regularization in hierarchical tensor factorization, a model
equivalent to certain deep convolutional neural networks. Through a dynamical
systems lens, we overcome challenges associated with hierarchy, and establish
implicit regularization towards low hierarchical tensor rank. This translates
to an implicit regularization towards locality for the associated convolutional
networks. Inspired by our theory, we design explicit regularization
discouraging locality, and demonstrate its ability to improve the performance
of modern convolutional networks on non-local tasks, in defiance of
conventional wisdom by which architectural changes are needed. Our work
highlights the potential of enhancing neural networks via theoretical analysis
of their implicit regularization.Comment: Accepted to ICML 202
Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding
Recent state-of-the-art natural language understanding models, such as BERT
and XLNet, score a pair of sentences (A and B) using multiple cross-attention
operations - a process in which each word in sentence A attends to all words in
sentence B and vice versa. As a result, computing the similarity between a
query sentence and a set of candidate sentences, requires the propagation of
all query-candidate sentence-pairs throughout a stack of cross-attention
layers. This exhaustive process becomes computationally prohibitive when the
number of candidate sentences is large. In contrast, sentence embedding
techniques learn a sentence-to-vector mapping and compute the similarity
between the sentence vectors via simple elementary operations. In this paper,
we introduce Distilled Sentence Embedding (DSE) - a model that is based on
knowledge distillation from cross-attentive models, focusing on sentence-pair
tasks. The outline of DSE is as follows: Given a cross-attentive teacher model
(e.g. a fine-tuned BERT), we train a sentence embedding based student model to
reconstruct the sentence-pair scores obtained by the teacher model. We
empirically demonstrate the effectiveness of DSE on five GLUE sentence-pair
tasks. DSE significantly outperforms several ELMO variants and other sentence
embedding methods, while accelerating computation of the query-candidate
sentence-pairs similarities by several orders of magnitude, with an average
relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE
produces sentence embeddings that reach state-of-the-art performance on
universal sentence representation benchmarks. Our code is made publicly
available at https://github.com/microsoft/Distilled-Sentence-Embedding.Comment: In Proceedings of AAAI 202
What Makes Data Suitable for a Locally Connected Neural Network? A Necessary and Sufficient Condition Based on Quantum Entanglement
The question of what makes a data distribution suitable for deep learning is
a fundamental open problem. Focusing on locally connected neural networks (a
prevalent family of architectures that includes convolutional and recurrent
neural networks as well as local self-attention models), we address this
problem by adopting theoretical tools from quantum physics. Our main
theoretical result states that a certain locally connected neural network is
capable of accurate prediction over a data distribution if and only if the data
distribution admits low quantum entanglement under certain canonical partitions
of features. As a practical application of this result, we derive a
preprocessing method for enhancing the suitability of a data distribution to
locally connected neural networks. Experiments with widespread models over
various datasets demonstrate our findings. We hope that our use of quantum
entanglement will encourage further adoption of tools from physics for formally
reasoning about the relation between deep learning and real-world data.Comment: Accepted to NeurIPS 202
What Algorithms can Transformers Learn? A Study in Length Generalization
Large language models exhibit surprising emergent generalization properties,
yet also struggle on many simple reasoning tasks such as arithmetic and parity.
This raises the question of if and when Transformer models can learn the true
algorithm for solving a task. We study the scope of Transformers' abilities in
the specific setting of length generalization on algorithmic tasks. Here, we
propose a unifying framework to understand when and how Transformers can
exhibit strong length generalization on a given task. Specifically, we leverage
RASP (Weiss et al., 2021) -- a programming language designed for the
computational model of a Transformer -- and introduce the RASP-Generalization
Conjecture: Transformers tend to length generalize on a task if the task can be
solved by a short RASP program which works for all input lengths. This simple
conjecture remarkably captures most known instances of length generalization on
algorithmic tasks. Moreover, we leverage our insights to drastically improve
generalization performance on traditionally hard tasks (such as parity and
addition). On the theoretical side, we give a simple example where the
"min-degree-interpolator" model of learning from Abbe et al. (2023) does not
correctly predict Transformers' out-of-distribution behavior, but our
conjecture does. Overall, our work provides a novel perspective on the
mechanisms of compositional generalization and the algorithmic capabilities of
Transformers.Comment: Preprin
Vanishing Gradients in Reinforcement Finetuning of Language Models
Pretrained language models are commonly aligned with human preferences and
downstream tasks via reinforcement finetuning (RFT), which entails maximizing a
(possibly learned) reward function using policy gradient algorithms. This work
highlights a fundamental optimization obstacle in RFT: we prove that the
expected gradient for an input vanishes when its reward standard deviation
under the model is small, even if the expected reward is far from optimal.
Through experiments on an RFT benchmark and controlled environments, as well as
a theoretical analysis, we then demonstrate that vanishing gradients due to
small reward standard deviation are prevalent and detrimental, leading to
extremely slow reward maximization. Lastly, we explore ways to overcome
vanishing gradients in RFT. We find the common practice of an initial
supervised finetuning (SFT) phase to be the most promising candidate, which
sheds light on its importance in an RFT pipeline. Moreover, we show that a
relatively small number of SFT optimization steps on as few as 1% of the input
samples can suffice, indicating that the initial SFT phase need not be
expensive in terms of compute and data labeling efforts. Overall, our results
emphasize that being mindful for inputs whose expected gradient vanishes, as
measured by the reward standard deviation, is crucial for successful execution
of RFT